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.1. Introduction 



Solution of the crystallographic phase problem by 
iterated projections 

Veit Elser * 

Department of Physics, Cornell University, Ithaca, NY 14853-2501 USA. Correspondence e-mail: vel0@cornell.edu 

An algorithm for determining crystal structures from diffraction data is described 
which does not rely on the usual Fourier-space formulations of atomicity. The 
new algorithm implements atomicity constraints in real-space, as well as intensity 
constraints in Fourier-space, by projections which restore each constraint with the 
minimal modification of the scattering density. To recover the true density, the 
two projections are combined into a single operation, the difference map, which 
is iterated until the magnitude of the density modification becomes acceptably 
small. The resulting density, when acted upon by a single additional operation, is 
by construction a density which satisfies both intensity and atomicity constraints. 
Numerical experiments have yielded solutions for atomic resolution x-ray data 
sets with over 400 non-hydrogen atoms, as well as for neutron data, where posi- 
tivity of the density cannot be invoked. 



'This year marks the semi-centennial of the realization by crys- 
tallographers that diffraction intensities possess sufficient infor- 
mation to reconstruct an atomistic structure (Sayre, 2002). The 
simple fact that the scattering arises from a known number of 
nearly point-like entities, while clearly not as intricate in con- 
tent as the body of collected intensities themselves, is by itself 
a significant piece of information. The first important steps in 
utilizing atomicity in structure determination where taken by 
Sayre (1952) in his celebrated equation, and later by Karle and 
Hauptman (1953) in their probabilistic analysis of structure fac- 
tors. What is remarkable in the subsequent fifty-year history of 
direct methods, especially in view of the development of the 
FFT already in the mid 1960s, is that atomicity has always been 
imposed in Fourier-space. The efficiency of the transformation 
to real-space, made possible by the FFT, might have ushered 
in an era where atomicity was imposed in the space where it 
is most naturally expressed. While the most successful direct 
method programs, such as SnB (Miller et al , 1994; Weeks 
& Miller, 1999) and SHELXD (Sheldrick, 1997, 1998), have 
adopted a significant degree of atomicity intervention in real- 
space, the traditional Fourier-space approach to atomicity has 
continued to be dominant in the development of algorithms. 

The aim of the work detailed below was to develop a prac- 
tical phase determination algorithm for crystal structures that 
imposed atomicity entirely within real-space. A key component 
of the algorithm is an iterative operation (difference map) that 
was discovered by deconstructing the most successful algorithm 
(hybrid input-output) for the phase problem in optics (Fienup, 
1982) and reexpressing it in terms having wider applicability 
(Elser, 2002a). Experiments with the atom.retriever im- 
plementation of the new algorithm on a variety of test structures 
demonstrate both its robustness and speed. The flexibility of the 
new approach, with respect to the kinds of constraints that can 
be imposed in real-space, raises hopes of an ab initio solver not 



limited by atomic resolution data. 
2. Constraints and projections 

The choice of the algorithm's fundamental variables is largely 
motivated by the mathematical structure of the iterative step 
(Section 3.). In particular, the object that is iterated should have 
the property that it can be added, in the sense of a linear vec- 
tor space, to other objects, and that there is a natural expression 
for the distance between objects. The unknown Fourier phases, 
for example, are not good candidates in this respect. A better 
choice, and the one we adopt, is the real-space scattering den- 
sity sampled on a finite regular grid. The relationship between 
real-space sampling and Fourier-space sampling on the recipro- 
cal lattice is quite direct, as illustrated by the two dimensional 
example in Figure 1. Shown on the left (a) is the actual scat- 
tering density within one unit cell. The structure factors of the 
corresponding crystal decay with scattering angle so that only a 
limited range about the origin in reciprocal space are measured. 
By padding with zeroes at the corners, the Fourier-space mea- 
surements can be fit into a finite rectangular grid as shown in 
the middle figure (b). Given phases for the structure factors on 
the bounded Fourier-space grid, the discrete Fourier transform 
of the resulting complex structure factors then gives the dis- 
cretely sampled real-space density shown on the left (c). Con- 
versely, given a scattering density on the real-space grid (c), the 
inverse Fourier transform gives the complex structure factors on 
the bounded Fourier-space grid, although with magnitudes not 
necessarily matching the measurements (b). 

2.1. Intensity constraints 

A valid density in real-space must first of all have the prop- 
erty that the inverse Fourier transform gives the measured struc- 
ture factor magnitudes. If not, one can seek the minimal density 
modification that brings the actual magnitudes into agreement 
with the measured ones. Using the symbol p to represent the 
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vector of densities on the real-space grid, the mathematical op- 
eration which accomplishes this is the projection nF(p). The 
projected density is uniquely defined by the properties that its 
Fourier transform has the correct (given) magnitudes and the 
distance ||nF(/9) — is minimized. It is convenient to use the 
Euclidean distance since it is preserved by Fourier transforma- 
tion: 

\\pf = Y..pI = Y.M = \\~pf ■ (1) 

In (1) the indices r and q denote grid points in real-space and 
Fourier-space (reciprocal lattice), respectively, and the complex 
structure factors pq are related to the real-space density by 



(2) 



where M is the total number of grid points (in real or Fourier 
space). Using the unit cell's fractional coordinates (x,>', z) 
to label r and the Miller indices {h,k,l) for q, we have 
q - r = hx + ky + Iz. The invariance of the distance (1) makes it 
possible to achieve the distance minimizing property of the pro- 
jection lip very easily in Fourier- space. Specifically, for every 
complex structure factor we wish to find the nearest point in 
the complex plane lying on a circle corresponding to the mea- 
sured magnitude Fq. The required projection is therefore accom- 
plished by simply rescaling the magnitude of the structure fac- 
tor: 

(U^p)) =7^Pq. (3) 
V /q iPql 

A vanishing denominator in (3) is not a problem since it repre- 
sents a set of measure zero if one neglects extinctions (where Fq 
also vanishes). The projection in real-space is expressed sym- 
bolically as: ^ 

nF = j^-'nFJ^. (4) 

When the Fourier transforms are implemented with the FFT, 
the computational cost of projecting a density on the intensity 
constraints grows as M log M. 

A practical algorithm using the Fourier intensity projection 
lip must address the fact that not all the structure factors on 
the Fourier-space grid will be measured. In addition to Fq, mea- 
surements near q = will frequently be absent or very unre- 
liable, particularly for large unit cell crystals. At the other ex- 
treme, structure factors in the corners of the Fourier-space grid 
(Fig. 1(b)) will be absent because data is normally collected 
within an ellipsoidal domain about the origin. The absent struc- 
ture factors can be treated in a uniform way by applying bound 
constraints rather than value constraints. A bound |pq| < in 
Fourier-space is geometrically a disk, and the projection which 
restores the bound constraint either leaves the structure factor 
unchanged, when it is inside the disk, or moves it to the nearest 
point on the circumference, when it lies outside. Using T) to de- 
note the set of grid points q for which measured values Fq exist, 
and assuming bounds F^ can be found for all others, equation 
(3) should be replaced by: 

, r(Vl^ql)Pq forqeT? 
(U^ip)) = { (^qVlPql)Pq for q ^ P and |pq| > F^ (5) 
Ipq forq^r»and|pq| <FB 



At large q one can obtain reliable bounds by extrapolating the 
measured structure factors on a Wilson plot (see Sec. 4.2.). Near 
the origin, where usually only few structure factors are absent, 
an infinite bound is usually adequate and avoids a more diffi- 
cult estimation problem. An example of a Fourier-space grid, 
showing structure factor values and bounds taken from data for 
a 148-atom peptide structure (Table 1 , ref . 5), is shown in Figure 
2. 

2.2. Atomicity constraints 

Atomicity can be imposed as a support constraint, where the 
support S of the density is a subset of the grid points in real- 
space with the property pt ^ implies r e 5. However, in 
contrast to phase retrieval with nonperiodic objects, where 5 is 
known or can be bounded, in crystallography one only knows 
that S has atomic characteristics. The simplest definition of an 
atomic support is the union of a known number of compact sub- 
sets of grid points, each representing one atom, and having ar- 
bitrary locations within the unit cell. Given a particular atomic 
support S, the projection Us of an arbitrary density p, to a min- 
imally modified density having support S, is simply 



for r G 5 
otherwise. 



(6) 



If A denotes the collection of all atomic supports (differing in 
atom locations), then atomicity projection is defined by 



nA(p) = ny(p) 



(7) 



where 5" e >l is the atomic support that minimizes ||n5'(p) — p|| 
over all S & A. While (6) can be computed quickly, an ex- 
haustive search over all atomic supports in A to find (7) may 
be prohibitive. We therefore adopt a heuristic (described below) 
that quickly finds an atomic support that is usually optimal. 

In describing the precise projection operation we distinguish 
two cases: atomicity projection for positive atoms (A-h), and 
atomicity projection for atoms of arbitrary sign (A). The for- 
mer is used with x-ray diffraction data, the latter with neutron 
diffraction data when atomic species with both signs of scatter- 
ing length are present. In both cases we assume the number of 
atoms per unit cell is known. In the case of x-ray diffraction 
this will usually not include H atoms. 

The first step in computing the projection (p) is to sort 
the density values on the real-space grid. Then, beginning with 
the largest density, grid points with the property of being a local 
maximum are identified. A local maximum is defined by hav- 
ing a larger density value than any of its 26 neighboring grid 
points. Each time a local maximum is found, the 27 density val- 
ues (maximum + neighbors) are copied, after positivity projec- 
tion, onto a real-space output grid that was initially set to zero. 
Positivity projection, given by 



(n+(p))r 



Pt if Pr > 

otherwise. 



(8) 



research papers 



is the minimal modification that restores the positivity of atoms. 
The search through the sorted densities terminates when lo- 
cal maxima have been identified and copied (positively) into 
the output grid. A graphical example of nA+ in two dimensions 
(and 8 neighbors) is shown in Figure 3(b). 

Two modifications are required to compute the projection to 
atoms of arbitrary sign, YIa{p). To identify large peaks of ar- 
bitrary sign, the densities on the real-space grid are sorted by 
absolute value. Then, densities in the sorted list are identified 
as local maxima or minima, depending on their sign. Positivity 
projection is still applied to local maxima, whereas local min- 
ima are subjected to its counterpart, negativity projection. Fig- 
ure 3(c) shows the action of FIa. The computationally cost of 
both types of atomicity projection is dominated by the sort of 
the M densities on the real-space grid. Using the quicksort al- 
gorithm this cost grows as MlogM, proportional to the cost of 
Fourier intensity projection (Section 2.1.). 

The atomicity projections described require data with suffi- 
cient resolution. From the conventional definition 

(resolution) = d^in = In j gmax , (9) 

where Q = (27r/A)2 sin Q is the magnitude of the physical scat- 
tering wavevector, one can obtain in physical units the real- 
space grid spacing. The relationship between Q and the vector 
of Miller indices q is given by 

(e/27r)2 = q • M • q ^ {hjaf + (V^)^ + {Ijcf , (10) 

where the matrix M is a metric constructed from the unit cell 
parameters, and the last expression gives the explicit form for 
an orthorhombic cell with dimensions a, b and c. For the ranges 
on the Miller indices to be consistent with a 2max. equation (10) 
shows in particular that 

\h\ < fl(gmax/27r) = a/d^in . (11) 

The number of Fourier- space grid points for the index h is there- 
fore 2{a/dmm), and since the real-space grid has the same num- 
ber of points and has physical dimension a, the grid spacing is 

According to our projection heuristic, a pair of local extrema 
(of the same sign) can never be neighbors on the grid, but must 
be separated by at least two grid spacings in one of the three 
dimensions. Supposing for simplicity that the unit cell is nearly 
orthorhombic, the most problematic relative displacement for a 
pair of atom centers is along the body diagonal of the grid. In 
that case the displacement must exceed (2, 2, 2) in grid units, 
otherwise the corresponding local extrema might be neighbors 
on the grid. The minimum separation of atomic centers must 
therefore satisfy rmin > 2\/3 in grid units, or fmin > V^dmin- 
Since for organic structures r^m ~ 1 •4A (neglecting H atoms), 
this statement implies dmin < 0.81 A. On the other hand, this 
bound is derived from the worst-case placement (relative to the 
grid) of two atoms in a structure of many atoms. Atom pairs 
displaced along a grid axis, for example, yield the more gen- 
erous bound Jmin < ''min- It is therefore not surprising that this 
form of atomicity projection has succeeded in solving organic 
structures at resolutions exceeding O.8IA (Table 1). 



3. The difference map 

Given two sets of constraints on the density, implemented re- 
spectively by projections Hi and 112, the difference map is an 
iterative procedure for obtaining a solution density psoi that sat- 
isfies both constraints, specifically: 

111 (psol) = Psol = n2(psol) • (12) 

Although our main interest is the projections Hi = Ha (or 
nA+) and 112 = Hp, we begin with a review of the solution 
method for a general pair of projections (Elser, 2002a). 

3.1. Fixed points and solutions 

Starting with an arbitrary initial density p{0), a sequence of 
iterates p{n) = D" {p{0)) is generated by repeated application 
of the difference map: 

D:p^p + P{Uiof2-U2of,){p). (13) 

Each of the projections in (13) is composed with a map 

/; = (l+70n,-7,- a =1,2); (14) 

/? 7^ 0, 71 and 72 are real parameters. The difference map has 
two key properties, the first being that a solution, as defined by 
(12), exists if and only if the map has a fixed point, p* = D{p*). 
To see this, note that the difference in (13) vanishes at a fixed 
point, hence: 

nio/2(p*) = n2o/i(p*) = p' . (15) 

Applying either of the projections to (15) and using the property 
n, o n, = n„ we obtain 

ni(p') = p' = n2(p'), (16) 

thus identifying p' with psoi- Conversely, if psoi exists, the 
set of fixed points is nonempty since it is easily verified that 

The fixed point property of the difference map makes no ref- 
erence to the detailed forms of the maps f,. These maps are key 
to the second property: the attractive nature of the fixed points. 
In an iterative solution method a fixed point is useless unless 
it is attractive; moreover, the greater the basin of attraction, 
the more effective is the method. Replacing the /} by identity 
maps, for example, creates unstable (repulsive) directions in a 
fixed point's local behavior (Elser, 2002a), effectively reducing 
to zero the probability of arriving at the fixed point. The chosen 
form (14) of the maps fi is the simplest, involving just the pro- 
jections, that for suitable values of the parameters 7; renders the 
fixed points of the difference map attractive. 

Fixed points of the difference map should not be confused 
with the solution. The former are not unique, comprising in 
fact a submanifold in the space of densities; formally, the 
submanifold of fixed points is given by the intersection of in- 
verse images: 

(Hi o/2)-i(p,oi) n (n2o/i)-'(Psol) . (17) 
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During the course of iterating the difference map the conver- 
gence to a fixed point is assessed by the norm of the difference: 

en = ||(nio/2-n2o/i)(p(n))|| . (18) 

When e„ becomes acceptably small, psoi is obtained, as in (15), 
by applying 111 o /2 (or 112 o /i) to the estimate p* « 

3.2. Parameter values 

For a particular /3, interpreted as a step size, the values of 71 
and 72 are selected to optimize the convergence at fixed points. 
The earliest analysis of the difference map (Elser, 2002a) as- 
sumed local orthogonaUty of the two constraint subspaces and 
found 

71 = (19) 

72 = 1//3. (20) 

Subsequent work (Elser, 2002b) considered an average-case 
analysis for particular kinds of constraints, including atomicity, 
and found optimal parameters (1 — > S, 2 — > F) 

7s = (21) 

1+?f(1-/3) ,,,, 
7f = ^ , (22) 

where fp ~ 0.5 is the fraction of Fourier-space grid points with 
known (non-negligible) structure factors, and the projection S 
(support) corresponds to atomicity (A or A+). However, due to 
the fact that the typical step size is /? « 0.7, the small numeri- 
cal difference between (20) and (22) has not led to a significant 
change in performance of the algorithm. All experiments quoted 
in this work used the simpler expression (20). 

Since there is as yet no theory for determining the optimal 
value of /? for any particular application, remains the single 
parameter of the algorithm that must be optimized empirically. 
Average solution times, measured in terms of difference map it- 
erations, are shown in Figure 4 as a function of /? for a 148-atom 
peptide structure (Table 1, ref. 5). Systematic trends in optimal 
P values with data resolution and M/N = (gridsize)/ (atoms) 
have not been performed, although it appears that optimal val- 
ues fall in the range shown (0.4 < /? < 0.8). 

Some algorithms (SnB, SHELXD) treat the number of non-H 
atoms as a parameter, with N deliberately chosen significantly 
smaller than the best estimate of the actual number to improve 
performance. Preliminary studies with the difference map al- 
gorithm on the 148-atom peptide structure showed only a very 
weak variation in the average number of iterations when was 
varied by ±8%. This result indicates that while is not a useful 
parameter, the algorithm can tolerate inevitable uncertainties in 
the actual numbers of atoms. From a logical viewpoint, choos- 
ing A' larger than the actual number of atoms is still valid: the 
atomicity constraint has simply been weakened. 

3.3. Convergence with imperfect data 

When formulated in terms of constraints, the uniqueness of 

solutions to the phase problem requires an overconstrained situ- 
ation. Considered geometrically, the two sets of constraints (say 



intensity and atomicity) are individually submanifolds in the 
space of densities having relatively low dimensionality. Specif- 
ically, in the overconstrained case the sum of the dimensional- 
ities is less than that of the ambient space (M), such that the 
intersection of generic submanifolds, of the same dimensions, 
would be empty (rather than a submanifold of positive dimen- 
sion). The constraint submanifolds in a well-posed phase prob- 
lem, given perfect data, are nongeneric in the sense that a solu- 
tion is known to exist, or equivalently, the submanifolds have 
a nonempty intersection in spite of their low dimensionaUty. 
The fine-tuning imphcit in the intensity data (say) required to 
achieve an intersection, or true solution, is upset by practically 
any departures from ideaUty. Chief among these in the crystal- 
lographic phase problem are statistical errors in the intensity 
measurements and the neglect of hydrogen atoms in the treat- 
ment of atomicity. Faced with these reaUties, one must abandon 
the hope of finding a solution in the strict sense. 

Although the constraint submanifolds are not expected to per- 
fectly intersect with realistic noisy data, we expect them to have 
a small separation (in the space of densities) in the vicinity of 
the true density. The convergence estimate e (see (18)) can be in- 
terpreted as the currently achieved distance between constraint 
submanifolds, and solutions should be identified not by its van- 
ishing but by its value dropping a significant amount. Plots of 
e„ as a function of iteration n are contrasted in Figure 5, for 
synthetically generated (top) and experimental (bottom) data. 
Experimental data for the 148-atom peptide structure (Table 1, 
ref. 5) was used for the imperfect data set, and synthetic data 
for a 148-equal-atom structure having the same M/N ratio was 
used to simulate perfect data. To ensure perfect compliance with 
the atomicity constraint, the Gaussian atoms used to create the 
perfect data where given supports on 3 x 3 x 3 subsets of the 
real-space grid; atom centers, given by a random number gener- 
ator, avoided the minimum separation ^nun = a/T2 (grid units). 
The sharp drop in e displayed by the experimental data, and 
observed in all difference map solutions reported here, demon- 
strates the viabiUty of e as a solution criterion even in the case 
of imperfect data. 

With imperfect data it appears that the minimum e occurs 
shortly after the initial drop, and is not surpassed subsequently. 
As a best estimate of a fixed point density we therefore take the 
difference map iterate at the e minimum; the corresponding so- 
lution estimate is found by applying nFo/A+ or (FIfo/a). Using 
this procedure on the known 148-atom peptide structure gave a 
mean figure-of-merit (cos A</)) = 0.71 when averaged over all 
reflections in the data. 

The overconstrained nature of the problem solved by the dif- 
ference map can be appreciated by closer examination of the 
solution found with perfect data. Shortly after the sharp drop 
in e when the solution is first found, e decays monotonically to 
zero. This behavior implies that the structure factors of the den- 
sity at large angles, that were provided only as bounds to the 
algorithm, are in fact being extrapolated to their true values by 
the iterative process (as was confirmed by direct examination of 
the solution's structure factors). 
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3.4. The solution process 

The problem of finding a point in Euclidean space that sat- 
isfies a number of constraints, or showing that no such point 
exists, is known as a feasibility problem in the optimization 
literature. Theoretical studies have mostly focused on the case 
of convex constraint subspaces, where monotonicity of conver- 
gence can usually be proven for a variety of iterative methods. 
However, since both sets of constraints in the crystallographic 
phase problem (Fourier intensity and atomicity) are nonconvex, 
no rigorous results are available. The local analysis of the dif- 
ference map, quoted above, only establishes the favorable fixed 
point characteristics of the map, and provides no estimate on 
the number of iterations required to enter a fixed point's sphere 
of influence. 

A dynamical systems perspective, combined with empirical 
data, provides a useful, though nonrigorous, picture for the dif- 
ference map's mode of operation. Some salient features of the 
evolution of the density are illustrated in Figure 6. Relatively 
rapid changes occur on the scale of very few iterations, and con- 
tinue in an apparent steady- state until quite abruptly the fixed 
point is encountered. During the long period of rapid changes 
there is no obvious progress toward the solution and the dy- 
namics is weU characterized as chaotic in the strongly mixing 
regime, where iterates settle into a stationary probability distri- 
bution very quickly. The basin of attraction of the map's fixed 
points has some overlap with this probability distribution, the 
magnitude of which determines the mean number of iterations 
required to arrive at the solution when averaged over starting 
points. 

Taking this interpretation as a hypothesis, it can be tested by 
compiling a distribution of solution times for a given problem 
instance, as measured by the number of iterations before the 
sharp drop in e occurs (Fig. 5). If the strongly mixing property 
holds, then iterates are effectively subject to a probability of ar- 
riving at the attractive basin of a fixed point that is constant in 
time, and hence solution times will have an exponential distri- 
bution. An experiment comprising 3800 trials for the 148-atom 
peptide structure (all with (3 — 0.7) showed exactly this distri- 
bution. A solution was found in each trial, the longest requiring 
7760 iterations, and the mean for all trials was 1 100 iterations. 
The distribution of iterations, normalized relative to the mean, is 
plotted in Figure 7 and compared with the exponential distribu- 
tion. Overall the agreement is very good: the slight deviation at 
small iterations can be explained by a combination of the fixed 
(but small) number of iterations required to converge, first to 
the stationary probability distribution, then, after arriving at the 
attractive basin, to the fixed point. 

The observed distribution of solution times greatly simpli- 
fies the solution protocol and eliminates yet another poten- 
tial parameter: the bound on the number of iterations. Iteration 
bounds, typically some multiple of the number of atoms A'^, are 
imposed by SnB and SHELXD and results are quoted in terms 
of "success rates". Given an exponential distribution of solu- 
tion times (Fig. 7), such a bound for the difference map algo- 
rithm is arbitrary since it will have no effect on the number of 
solutions found per total iterations performed on all trials. Ex- 



pressed in more direct terms: the performance of the algorithm 
is practically unaffected by random restarts, and hence there is 
no degradation of performance when iterations are allowed to 
continue indefinitely. It is possible, however, that this conclu- 
sion will have to be modified if further experimentation with 
the difference map, say with smaU (3, finds a nonexponential 
distribution of solution times. 

4. Studies of test structures 

4.1. The atom_retriever computer program 

A preliminary implementation of the difference map algo- 
rithm for crystallographic applications exists as the C-language 
program atom_retriever. The software in its current form 
is best characterized as a library of general-purpose subroutines 
that manipulate data in the uniform format of discretely sampled 
densities. At the lowest level are subroutines for performing a 
variety of projections, Fourier intensity and atomicity projection 
being of primary interest to crystallographers. The next level of 
subroutines combine the chosen projections into the difference 
map. Finally, at the highest level are a collection of drivers and 
translators, that provide options for monitoring and terminating 
iterations, as well as converting input structure factor data files 
into the rectangular arrays used by the algorithm. With this de- 
gree of transparency, it is hoped that users will experiment with 
innovations at the level of the projections, which are at the heart 
of the algorithm's success. 

4.2. Space groups and structure factor input 

The primary input to atom_retriever is the rectangular 
array, containing structure factor values and bounds (Fig. 2), 
and used by the Fourier intensity projection subroutine. Soft- 
ware for applying symmetry elements to structure factor data in 
the construction of these arrays is still being developed, limit- 
ing applications to structures with triclinic (PI and PI) space 
groups. The complete set of space groups can be implemented 
by preparing the initial density p(0) with a projection that rec- 
ognizes in Fourier space the phase relationships between struc- 
ture factors for the specified group; no changes are necessary in 
the atom_retriever program, since, apart from numerical 
rounding effects, the difference map preserves the symmetry of 
the density. Since symmetry projection software is also under 
development, the PI structures studied to date were treated as 
P 1 , with twice the true number of symmetry inequivalent atoms. 

The truncation of the individual atomic supports during atom- 
icity projection to a 3 x 3 x 3 array of grid points was shown by 
Elser (2002a) to be optimal when the corresponding Gaussian 
atom has a mean square displacement (m^) « 0.55 in grid units, 
or 

B/^Tz' = {ul)^Q.Udl^, (23) 

where B is an effective isotropic temperature factor. Defining a 
dimensionless temperature factor by 

b = B/dl^ , (24) 

one finds that typical x-ray and neutron data sets (see Table 1) 
satisfy b < bopt « 1 1 . This means that 3 x 3 x 3 is usually a 
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generous support, perhaps even sufficient to accommodate hy- 
drogen neighbors. 

An effective temperature factor B, which combines the ef- 
fects of atomic size, thermal vibration and certain kinds of static 
disorder, is estimated from the data by making a linear least- 
squares fit of pairs {Q^, log \Fq\^} to the form 

log|Fe|2=A-iB(e/27^)^ (25) 

for measurements in a restricted range Q > Qq, where typically 
27r/2o = I.2A. At large spatial frequencies the structure fac- 
tors are well modeled as complex Gaussian random variables 
(isotropic with mean 0) and hence the distribution of intensities 
at wave vector Q is given by 

Pq{I) dl = exp i-I/{lQ))dI/{lQ) . (26) 

Since a least-squares fit applied to (25) gives a formula for the 
average of log Iq, the distribution (26) gives the result 

A - ^B{Q/27rf = (log/e) = log {Iq) - 7 , (27) 

where 7 is Euler's constant. 

The probability distribution (26) is also used in the determi- 
nation of bounds on the magnitudes of unmeasured structure 
factors with Q > 2o, the bulk being in the comers of the 
Fourier-space grid where Q > In/dmm- If M' is the number 
of such (synunetry inequivalent) structure factors, then 1/M' is 
an acceptable probability for an actual intensity to exceed the 
bound. Equating this with the probabiUty / > {Fq)^ computed 
using (26), gives 

F| = ^{Iq) log M' , (28) 

where {Iq) is given explicitly in terms of the parameters A and 
B of the fit by (27). There are far fewer urmieasured structure 
factors with Q < Qq, and the bound Fg = 00 was used for 
these in all the studies reported here. 

4.3. Discussion of tests 

The results of a selection of tests of the atom.retriever 
program are summarized in Table 1 . In each test the entire avail- 
able set of experimental structure factors was used, with miss- 
ing data replaced by bounds in the manner described above; 
the grid dimensions correspond to (2|/i|niax + 2) x (2|A:|max + 
2) X (2|Z|max + 2). No effort was made to individually optimize 
performance with respect to /?; the value chosen, /3 = 0.7, is 
near the minimum in the average number of iterations for the 
148-atom peptide structure (see Fig. 4). Mean figures of merit 
(cos A0) for the PI structures were determined relative to cal- 
culated phases. For the PI structures, where calculated phases 
were not available, an internal figure of merit was determined by 
treating the structures as PI (with twice as many atoms) and tak- 
ing the nearest set of centrosymmetric phases as the true phases. 
When the number of iterations required by the algorithm has a 
broad distribution (see Fig. 7), the average number is quoted. 



Atomicity projection for positive atoms (nA+) was used in all 
the x-ray data sets; the neutron data set for the mineral monte- 
brasite provided the sole appUcation of the projection to atoms 
of arbitrary sign, FIa. Nuclei with negative scattering length, 
such as Li, show up as light contrast in a field of dark atoms 
in plots of the scattering density psoi- Figure 8 shows the scat- 
tering density in one layer of the montebrasite solution. The 
small number of iterations required to find the solution is typi- 
cal of few-atom structures, where the error diagnostic (18) de- 
creases almost monotonicaUy with iteration (Fig. 8). When ap- 
propriately translated, the scattering density was nearly cen- 
trosymmetric, the resulting internal figure of merit being signifi- 
cantly larger than what would be obtained from random phases: 

(cos A(/)) rand = 2/tT « 0.64. 

Rapid solutions with a near monotonic error decrease were 
also observed for the other few-atom structures: nitramine, pyr- 
role, punctaporonin and triphenylphosphine. The nitramine and 
pyrrole structures were selected for their high density and cell 
aspect ratios respectively. These characteristics had no notice- 
able effects on the algorithm's performance. Triphenylphos- 
phine, interestingly, has more atoms (when treated as PI) than 
the 148-atom peptide structure which requires many more it- 
erations. Data resolution is not the cause of this anomaly, as 
was confirmed by truncating the triphenylphosphine data to the 
same 0.96 resolution as the peptide: the solution was again 
found in only 25 iterations. A more likely cause is the presence 
of a moderately heavy P atom in each of the eight triphenylphos- 
phine molecules of the structure, in contrast to the near equality 
of the non-H atoms of the peptide. Arbitrary atomic charges 
are trivially acconomodated by the form of atomicity projec- 
tion used by atom.retriever. A comparably non-specific 
atomicity was achieved relatively late in the development of the 
Fourier-space based framework (Hauptman, 1976; Rothbauer, 
2000). 

It is premature to assess the algorithm's prospects in solving 
large structures from atomic resolution data. The average solu- 
tion time for the 148-atom peptide was 35 seconds on a 2GHz 
Pentium 4 (single processor), not far behind SnB2 .2 (30 sec- 
onds) and SHELXD (11 seconds). More critical in evaluating 
performance is the growth in the average number of iterations 
with structure size. The largest structure attempted, a synthetic 
a-helical bundle, required nearly a half-million iterations per 
solution (about 30 hours). Figure 9 shows the error estimate for 
a typical run, together with electron density contours obtained 
directly from the discretely sampled density psoi- The single, no- 
ticeably larger peak in the density was identified with the chlo- 
ride ion in the structure. This synthetic peptide differs from the 
smaller structures in that the number of ordered, non-H atomic 
sites (AO is not known a priori because of the solvent contribu- 
tion. The results quoted all used = 479, the number of non-H 
atoms discovered in the original refinement (Prive et al , 1999), 
but the structure was also solved with N as small as 440. 

5. Conclusions 

The crystallographic phase problem consists of two, logically 
distinct, though technically coupled, parts. Using the terms 
"needles" for solutions and "hay" for non-solutions, these parts 
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are: (1) distinguishing needles from hay, and (2) finding the 
proverbial needle in a haystack. As this work hopefully demon- 
strates, it is quite straightforward to recognize needles when 
presented with one: one only requires projections that act triv- 
ially (with negligible change) when operating on an object 
(i.e. scattering density) that satisfies all the known constraints 
(Fourier intensity, atomicity). If the constraints are too weak, as 
in a low resolution data set, the phase problem may not be sol- 
uble in principle, because needles cannot be distinguished from 
the hay. On the other hand, even for low resolution data it is 
known (Podjamy et al , 1987) that in certain circumstances nee- 
dles can still be recognized. A properly phased protein crystal, 
for example, will often have a well defined solvent region and a 
characteristic histogram of density values (possibly at multiple 
length scales) within the body of the molecule, even when indi- 
vidual atoms are not resolved. If a projection (minimal density 
modification) can be constructed that applies in this more gen- 
eral setting, the first part of the phase problem can be said to be 
solved. 

The difference map solves the second part of the phase prob- 
lem by providing a uniform scheme for combining the applica- 
ble projections into an algorithm that finds the needle. Its even- 
tual success is practically guaranteed if the corresponding con- 
straints are strong enough to make needles distinct from hay. 
Perhaps most remarkable of all is the empirical fact that needles 
can be found in a reasonable time at all. Though the attractive 
basins of the difference map's fixed points are, by proper choice 
of parameters, tuned to be as large as possible (Elser, 2002a, 
2002b), this local optinoization cannot predict the average solu- 
tion time. With more experience we anticipate a body of empir- 
ical relationships between average solution times and charac- 
teristics of the data that can fill this gap. Some progress in this 
direction was recently achieved (Elser, 2002c) in a highly ide- 
alized version of the phase problem (Zwick et al , 1996), where 
the discretely sampled (one dimensional) density is known to 
be two- valued (a binary sequence). 
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Table 1 

Test structures solved by atom.retriever. 



structure 


ref. 


data 


refl's 


N 


group 




b 


grid 


/3 


iter's 


(cos A(p) 


montebrasite 


1 


neutron 


1024 


20 


PI* 


0.66 


2.5 


22 X 24 X 22 


0.7 


20 


0.76 


TEX nitramine 


2 


x-ray 


1911 


36 


PI* 


0.76 


6.6 


20 X 22 X 24 


0.7 


15 


0.77 


amphiphilic pyrrole 


3 


x-ray 


4238 


48 


PI* 


0.81 


11.0 


14 X 26 X 52 


0.7 


25 


0.76 


punctaporonin D 


4 


x-ray 


3764 


54 


PI 


0.85 


10.4 


22 X 26 X 30 


0.7 


75 


0.72 


a-helbt & 3io/a-helix 


5 


x-ray 


7489 


148 


PI 


0.96 


7.5 


22 X 36 X 42 


0.7 


1100 


0.71 


triphenylphosphine 


6 


x-ray 


9982 


152 


PI* 


0.84 


2.9 


28 X 36 X 42 


0.7 


25 


0.79 


a-helical bundle 


7 


x-ray 


23681 


479 


PI 


0.90 


10.1 


40 X 46 X 60 


0.7 


450000 


0.75 



* The actual structure has symmetry PI, but was treated as PI with twice as many atoms. 
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Figure 2 

Structure factors in the plane /i = for a 148-atom peptide structure (Table 1, ref. 5): (a) measured values; (b) bounds. 
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Figure 3 

Atomicity projections of a two dimensional density, (a) Density p witiiin one unit cell, (b) nA+(p), tlie density closest to (a) having five positive atoms, (c) Tlp^{p), 
same as (b) but with the signs of the atoms unspecified. 
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Figure 4 

Average number of difference map iterations required to solve a 148-atom structure (Table 1, ref. 5) for the parameter range 0.4 < /3 < 0.8 (over 200 solutions per 
point). 
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Figure 5 

Evolution of the error estimate e with difference map iteration for two 148-atom data sets: (top) synthetic data for ideal 3x3x3 atoms and no H-atoms, (bottom) 
experimental data (Table 1, ref. 5). 
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Figure 6 

Detailed behavior of the densities in the plane ^ = for the solution shown at the top of Fig. 5: (a-b) consecutive densities /o(lOO) and /o(lOl), (c) fixed point density 
p* Ri p(3000), (d) solution density p^oi = Uf o /a-|.(p*). 
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Figure 8 

atom.retriever solution for the neutron data set of the mineral montebrasite (Table 1); (left) near-monotonic decrease of the en'or with iteration, (right) scattering 
density Psoi in a plane of the structure showing an atom (Li) with negative scattering length (light contrast). 
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Figure 9 

at om.retriever solution of the synthetic a-heUcal bundle (Table 1); (top) evolution of the error estimate, (bottom) electron density contours in a plane containing 
the chlorine ion (top, left of center). 



